Search CORE

14 research outputs found

Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

Author: Barman Raphaël
Clematide Simon
Ehrmann Maud
Kaplan Frédéric
Oliveira Sofia Ares
Publication venue: 'Centre pour la Communication Scientifique Directe (CCSD)'
Publication date: 14/12/2020
Field of study

The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

Author: Barman Raphaël
Clematide Simon
Ehrmann Maud
Kaplan Frédéric
Oliveira Sofia Ares
Publication venue: 'Centre pour la Communication Scientifique Directe (CCSD)'
Publication date: 01/01/2021
Field of study

ZORA

Une approche computationnelle du cadastre napoléonien de Venise

Author: Barman Raphaël
Kaplan Frédéric
Lenardo Isabella di
Pardini Federica
Publication venue: 'OpenEdition'
Publication date: 01/05/2021
Field of study

Au début du xixᵉ siècle, l’administration napoléonienne impose à la ville de Venise la mise en place d’un nouveau système de description standardisé pour rendre compte de manière objective de la forme et des fonctions du tissu urbain. Le cadastre, déployé à l’échelle européenne, offre pour la première fois une vue articulée et précise de la structure de la ville et de ses activités grâce à une approche méthodique et à des catégories standardisées. Les techniques numériques, basées notamment sur l’apprentissage profond, permettent aujourd’hui d’extraire de ces documents une représentation à la fois précise et dense de la ville et de ses habitants. En s’attachant à vérifier systématiquement la cohérence de l’information extraite, ces techniques évaluent aussi la précision et la systématicité du travail des arpenteurs et des sondeurs de l’Empire et qualifient par conséquent, de façon indirecte, la confiance à accorder aux informations extraites. Cet article revient sur l’histoire de ce protosystème computationnel, décrit la manière dont les techniques numériques offrent non seulement une documentation systématique, mais aussi des perspectives d’extraction d’informations latentes, encore non explicitées, mais implicitement présentes dans ce système d’information du passé.At the beginning of the 19th century, the Napoleonic administration introduced a new standardised description system to give an objective account of the form and functions of the city of Venice. The cadastre, deployed on a European scale, was offering for the first time an articulated and precise view of the structure of the city and its activities, through a methodical approach and standardised categories. With the use of digital techniques, based in particular on deep learning, it is now possible to extract from these documents an accurate and dense representation of the city and its inhabitants. By systematically checking the consistency of the extracted information, these techniques also evaluate the precision and systematicity of the surveyors’ work and therefore indirectly qualify the trust to be placed in the extracted information. This article reviews the history of this computational protosystem and describes how digital techniques offer not only systematic documentation, but also extraction perspectives for latent information, as yet uncharted, but implicitly present in this information system of the past

Directory of Open Access Journals

OpenEdition

Historical newspaper semantic segmentation using visual and textual features

Author: Barman Raphaël
Publication venue
Publication date: 14/10/2019
Field of study

Mass digitization and the opening of digital libraries gave access to a huge amount of historical newspapers. In order to bring structure into these documents, current techniques generally proceed in two distinct steps. First, they segment the digitized images into generic articles and then classify the text of the articles into finer-grained categories. Unfortunately, by losing the link between layout and text, these two steps are not able to account for the fact that newspaper content items have distinctive visual features. This project proposes two main novelties. Firstly, it introduces the idea of merging the segmentation and classification steps, resulting in a fine- grained semantic segmentation of newspapers images. Secondly, it proposes to use textual features under the form of embeddings maps at segmentation step. The semantic segmentation with four categories (feuilleton, weather forecast, obituary, and stock exchange table) is done using a fully convolutional neural network and reaches a mIoU of 79.3%. The introduction of embeddings maps improves the overall performances by 3% and the generalization across time and newspapers by 8% and 12%, respectively. This shows a strong potential to consider the semantic aspect in the segmentation of newspapers and to use textual features to improve generalization

Infoscience - École polytechnique fédérale de Lausanne

Language Resources for Historical Newspapers: the Impresso Collection

Author: Barman Raphaël
Clematide Simon
Ehrmann Maud
Romanello Matteo
Ströbel Phillip
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2020
Field of study

Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge– and real promise of digitization– is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this ‘Big Data of the Past’. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the ‘impresso - Media Monitoring of the Past’ project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents

ZORA

Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

Author: Barman Raphaël
Clematide Simon
Ehrmann Maud
Kaplan Frédéric
Oliveira Sofia Ares
Publication venue
Publication date: 19/01/2021
Field of study

ZORA

Datasets and Models for Historical Newspaper Article Segmentation

Author: Ares Oliveira Sofia
Barman Raphaël
Clematide Simon
Ehrmann Maud
Publication venue
Publication date: 30/01/2021
Field of study

Dataset and models used and produced in the work described in the paper "Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers": https://infoscience.epfl.ch/record/282863?ln=e

Infoscience - École polytechnique fédérale de Lausanne

Repopulating Paris: massive extraction of 4 Million addresses from city directories between 1839 and 1922

Author: Barman Raphaël
Descombes Albane Bénédicte
di Lenardo Isabella
Kaplan Frédéric
Publication venue: Utrecht University
Publication date: 03/08/2020
Field of study

In 1839, in Paris, the Maison Didot bought the Bottin company. Sébastien Bottin trained as a statistician was the initiator of a high impact yearly publication, called “Almanachs" containing the listing of residents, businesses and institutions, arranged geographically, alphabetically and by activity typologies (Fig. 1). These regular publications encountered a great success. In 1820, the Parisian Bottin Almanach contained more than 50 000 addresses and until the end of the 20th century the word “Bottin” was the colloquial term to designate a city directory in France. The publication of the “Didot-Bottin” continued at an annual rhythm, mapping the evolution of the active population of Paris and other cities in France.The relevance of automatically mining city directories for historical reconstruction has already been argued by several authors (e.g Osborne, N., Hamilton, G. and Macdonald, S. 2014 or Berenbaum, D. et al. (2016). This article reports on the extraction and analysis of the data contained in “Didot-Bottin” covering the period 1839-1922 for Paris, digitized by the Bibliotheque nationale de France. We process more than 27 500 pages to create a database of 4,2 Million entries linking addresses, person mention and activities

Infoscience - École polytechnique fédérale de Lausanne

Repopulating Paris: massive extraction of 4 Million addresses from city directories between 1839 and 1922.

Author: Barman Raphaël
Descombes Albane
di Lenardo Isabella
Kaplan Frédéric
Publication venue: DataverseNL
Publication date
Field of study

Abstract of paper 0878 presented at the Digital Humanities Conference 2019 (DH2019), Utrecht , the Netherlands 9-12 July, 2019

Language Resources for Historical Newspapers: the Impresso Collection

Author: Barman Raphaël
Clematide Simon
Ehrmann Maud
Romanello Matteo
Ströbel Phillip Benjamin
Publication venue: Paris, European Language Resources Association
Publication date: 16/09/2020
Field of study

Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge-- and real promise of digitization-- is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this `Big Data of the Past'. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the `impresso - Media Monitoring of the Past' project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents

Infoscience - École polytechnique fédérale de Lausanne